perf : Optimize count distinct using bitmaps instead of hashsets for smaller datatypes by coderfender · Pull Request #21456 · apache/datafusion

coderfender · 2026-04-08T08:25:40Z

Which issue does this PR close?

Remove hashset based accumulators for smaller int data types and use bitmaps. Follow up of : #21453

Closes Use bitmap for count_distinct expression for u8/16 and i8/16 [perf] #21488

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

coderfender · 2026-04-08T08:26:42Z

benchmark results :

count_distinct i16 bitmap                      1.00      3.3±0.43µs        ? ?/sec    23.87    78.4±0.84µs        ? ?/sec
count_distinct i8 bitmap                       1.00      2.3±0.49µs        ? ?/sec    7.13     16.7±0.55µs        ? ?/sec
count_distinct u16 bitmap                      1.00      3.1±0.18µs        ? ?/sec    25.45    78.8±3.92µs        ? ?/sec
count_distinct u8 bitmap                       1.00      2.3±0.34µs        ? ?/sec    7.37     16.9±0.14µs        ? ?/sec

It seems like we are 25x faster for u16 bitmap based accumulators (or I am sleepy :) )

Dandandan · 2026-04-08T09:11:51Z

I think we can do the same for 16 bit types, it is just 65_536 bytes 8192 if we use a bitmap.

Dandandan · 2026-04-08T09:12:46Z

Oh wait, you're already doing that :)

coderfender · 2026-04-08T21:38:57Z

Query 0 in clickbench_extended dataset (which uses count distinct on u8 is now ~ 11 % faster :

┃ Query     ┃    main_cb ┃ bitmap_cb_2 ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0  │  529.46 ms │   478.99 ms │ +1.11x faster │

(Other queries are faster but I believe that is more around variance )

┏━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃    main_cb ┃ bitmap_cb_2 ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0  │  529.46 ms │   478.99 ms │ +1.11x faster │
│ QQuery 1  │  107.43 ms │   102.59 ms │     no change │
│ QQuery 2  │  250.89 ms │   240.76 ms │     no change │
│ QQuery 3  │  207.67 ms │   207.49 ms │     no change │
│ QQuery 4  │  391.43 ms │   353.05 ms │ +1.11x faster │
│ QQuery 5  │ 4144.11 ms │  4084.08 ms │     no change │
│ QQuery 6  │  676.03 ms │   622.21 ms │ +1.09x faster │
│ QQuery 7  │  719.78 ms │   599.06 ms │ +1.20x faster │
│ QQuery 8  │  238.30 ms │   207.27 ms │ +1.15x faster │
│ QQuery 9  │ 1531.52 ms │  1406.34 ms │ +1.09x faster │
│ QQuery 10 │  435.27 ms │   403.28 ms │ +1.08x faster │
│ QQuery 11 │ 1043.68 ms │   955.22 ms │ +1.09x faster │
│ QQuery 12 │  113.16 ms │   106.31 ms │ +1.06x faster │
└───────────┴────────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary          ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main_cb)       │ 10388.73ms │
│ Total Time (bitmap_cb_2)   │  9766.66ms │
│ Average Time (main_cb)     │   799.13ms │
│ Average Time (bitmap_cb_2) │   751.28ms │
│ Queries Faster             │          9 │
│ Queries Slower             │          0 │
│ Queries with No Change     │          4 │
│ Queries with Failure       │          0 │
└────────────────────────────┴────────────┘

coderfender · 2026-04-08T21:41:04Z

cc : @neilconway , @alamb , @martin-g . Please take a look whenever you get a chance

alamb

This looks like a great idea. Thank you @coderfender

alamb · 2026-04-09T20:25:40Z

run benchmark count_distinct

alamb · 2026-04-09T20:26:12Z

run benchmark clickbench

adriangbot · 2026-04-09T20:28:17Z

🤖 Criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4217267905-1036-wmjln 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing optimize_count_distinct (a13bcaa) to fbdf770 (merge-base) diff
BENCH_NAME=count_distinct
BENCH_COMMAND=cargo bench --features=parquet --bench count_distinct
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-04-09T20:28:18Z

Benchmark for this request failed.

Last 20 lines of output:

Click to expand

rustc 1.94.1 (e408947bf 2026-03-25)
a13bcaad5b82e69257e82b741cf72619da838990
fbdf7703a96408b4eba27801431be8bf468734d8
error: failed to load manifest for workspace member `/workspace/datafusion-branch/datafusion/catalog`
referenced by workspace at `/workspace/datafusion-branch/Cargo.toml`

Caused by:
  failed to load manifest for dependency `datafusion-datasource`

Caused by:
  failed to load manifest for dependency `datafusion-physical-plan`

Caused by:
  failed to load manifest for dependency `datafusion-functions-aggregate`

Caused by:
  failed to parse manifest at `/workspace/datafusion-branch/datafusion/functions-aggregate/Cargo.toml`

Caused by:
  found duplicate bench name count_distinct, but all bench targets must have a unique name

File an issue against this benchmark runner

adriangbot · 2026-04-09T20:28:35Z

🤖 Criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4217270691-1037-8wdsm 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing optimize_count_distinct (a13bcaa) to fbdf770 (merge-base) diff
BENCH_NAME=clickbench
BENCH_COMMAND=cargo bench --features=parquet --bench clickbench
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-04-09T20:28:36Z

Benchmark for this request failed.

Last 20 lines of output:

Click to expand

rustc 1.94.1 (e408947bf 2026-03-25)
a13bcaad5b82e69257e82b741cf72619da838990
fbdf7703a96408b4eba27801431be8bf468734d8
error: failed to load manifest for workspace member `/workspace/datafusion-branch/datafusion/catalog`
referenced by workspace at `/workspace/datafusion-branch/Cargo.toml`

Caused by:
  failed to load manifest for dependency `datafusion-datasource`

Caused by:
  failed to load manifest for dependency `datafusion-physical-plan`

Caused by:
  failed to load manifest for dependency `datafusion-functions-aggregate`

Caused by:
  failed to parse manifest at `/workspace/datafusion-branch/datafusion/functions-aggregate/Cargo.toml`

Caused by:
  found duplicate bench name count_distinct, but all bench targets must have a unique name

File an issue against this benchmark runner

coderfender · 2026-04-10T21:38:23Z

Recent benchmarks :

( not sure if this is due to variance with the build env )

group                              bitmap_count_distinct                  main
-----                              ---------------------                  ----
count_distinct i16 bitmap          1.00      3.1±0.27µs        ? ?/sec    25.69    80.7±0.62µs        ? ?/sec
count_distinct i64 80% distinct    1.00     48.7±0.49µs        ? ?/sec    1.00     48.9±1.01µs        ? ?/sec
count_distinct i64 99% distinct    1.00     49.0±1.97µs        ? ?/sec    1.04     51.0±3.38µs        ? ?/sec
count_distinct i8 bitmap           1.00      2.2±0.18µs        ? ?/sec    7.60     17.0±0.16µs        ? ?/sec
count_distinct u16 bitmap          1.00      3.1±0.17µs        ? ?/sec    25.99    81.4±0.84µs        ? ?/sec
count_distinct u8 bitmap           1.00      2.0±0.01µs        ? ?/sec    8.42     17.2±0.17µs        ? ?/sec

alamb · 2026-04-10T21:56:53Z

run benchmark count_distinct

adriangbot · 2026-04-10T21:59:26Z

🤖 Criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4227023688-1062-bkn4c 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing optimize_count_distinct (61dc8e1) to eaf0a41 (merge-base) diff
BENCH_NAME=count_distinct
BENCH_COMMAND=cargo bench --features=parquet --bench count_distinct
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-04-10T22:03:00Z

🤖 Criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                              main                                   optimize_count_distinct
-----                              ----                                   -----------------------
count_distinct i16 bitmap          17.96   164.7±0.26µs        ? ?/sec    1.00      9.2±0.73µs        ? ?/sec
count_distinct i64 80% distinct    1.00    102.2±0.40µs        ? ?/sec    1.11    113.0±0.47µs        ? ?/sec
count_distinct i64 99% distinct    1.00    102.2±0.35µs        ? ?/sec    1.12    114.9±0.32µs        ? ?/sec
count_distinct i8 bitmap           5.32     31.1±0.11µs        ? ?/sec    1.00      5.8±0.00µs        ? ?/sec
count_distinct u16 bitmap          26.41   157.3±2.49µs        ? ?/sec    1.00      6.0±0.20µs        ? ?/sec
count_distinct u8 bitmap           5.22     30.9±0.03µs        ? ?/sec    1.00      5.9±0.07µs        ? ?/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	59.8s
Peak memory	3.5 GiB
Avg memory	3.5 GiB
CPU user	74.0s
CPU sys	1.0s
Peak spill	0 B

branch

Metric	Value
Wall time	53.9s
Peak memory	3.5 GiB
Avg memory	3.5 GiB
CPU user	68.8s
CPU sys	0.2s
Peak spill	0 B

File an issue against this benchmark runner

coderfender · 2026-04-10T22:10:31Z

Okay this still seems to be an issue. Let me try and see if I can add additional hints to the compiler and see if that helps not regress existing hotpaths for i64

…inct' into optimize_count_distinct

adriangbot · 2026-04-11T02:43:49Z

Hi @coderfender, thanks for the request (#21456 (comment)). Only whitelisted users can trigger benchmarks. Allowed users: Dandandan, Fokko, Jefffrey, Omega359, adriangb, alamb, asubiotto, brunal, buraksenn, cetra3, codephage2020, comphead, erenavsarogullari, etseidl, friendlymatthew, gabotechs, geoffreyclaude, grtlr, haohuaijin, jonathanc-n, kevinjqliu, klion26, kosiew, kumarUjjawal, kunalsinghdadhwal, liamzwbao, mbutrovich, mzabaluev, neilconway, rluvaton, sdf-jkl, timsaucer, xudong963, zhuqi-lucas.

File an issue against this benchmark runner

alamb · 2026-04-11T09:46:46Z

run benchmark count_distinct

adriangbot · 2026-04-11T09:47:09Z

🤖 Criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4229213096-1078-nmk9t 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing optimize_count_distinct (554f60c) to eaf0a41 (merge-base) diff
BENCH_NAME=count_distinct
BENCH_COMMAND=cargo bench --features=parquet --bench count_distinct
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-04-11T09:52:43Z

🤖 Criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                              main                                   optimize_count_distinct
-----                              ----                                   -----------------------
count_distinct i16 bitmap          23.98   164.8±0.31µs        ? ?/sec    1.00      6.9±0.26µs        ? ?/sec
count_distinct i64 80% distinct    1.00    101.7±0.43µs        ? ?/sec    1.11    112.9±0.39µs        ? ?/sec
count_distinct i64 99% distinct    1.00    101.9±0.17µs        ? ?/sec    1.13    115.1±0.43µs        ? ?/sec
count_distinct i8 bitmap           7.17     31.1±0.08µs        ? ?/sec    1.00      4.3±0.03µs        ? ?/sec
count_distinct u16 bitmap          15.00   155.6±0.39µs        ? ?/sec    1.00     10.4±0.31µs        ? ?/sec
count_distinct u8 bitmap           7.16     30.8±0.06µs        ? ?/sec    1.00      4.3±0.01µs        ? ?/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	59.6s
Peak memory	3.2 GiB
Avg memory	3.1 GiB
CPU user	74.1s
CPU sys	0.8s
Peak spill	0 B

branch

Metric	Value
Wall time	58.0s
Peak memory	3.2 GiB
Avg memory	3.1 GiB
CPU user	72.8s
CPU sys	0.2s
Peak spill	0 B

File an issue against this benchmark runner

coderfender · 2026-04-12T02:04:03Z

My suspicion is that the match arm's bloat is causing regression . Trying out options to reduce the CPU cache pressure to potentially reduce code bloat / match arm bloat and see if that removes the regression

coderfender · 2026-04-12T02:55:35Z

Pushed a commit to move out bitmap based accumulators to a separate function with a cold hint to compiler. Benchmarks on my machine slow that i64 path is now slightly faster so my hunch is that this should be same as main branch on Github benchmarks


group                              bitmap_cold_hint                       main
-----                              ----------------                       ----
count_distinct i16 bitmap          1.00      2.9±0.03µs        ? ?/sec    27.65    80.7±0.62µs        ? ?/sec
count_distinct i64 80% distinct    1.00     46.7±0.22µs        ? ?/sec    1.05     48.9±1.01µs        ? ?/sec
count_distinct i64 99% distinct    1.00     47.3±0.77µs        ? ?/sec    1.08     51.0±3.38µs        ? ?/sec
count_distinct i8 bitmap           1.00   1073.7±7.10ns        ? ?/sec    15.83    17.0±0.16µs        ? ?/sec
count_distinct u16 bitmap          1.00      3.1±0.19µs        ? ?/sec    26.58    81.4±0.84µs        ? ?/sec
count_distinct u8 bitmap           1.00  1083.1±20.24ns        ? ?/sec    15.89    17.2±0.17µs        ? ?/sec

Dandandan · 2026-04-12T09:13:01Z

run benchmark count_distinct

adriangbot · 2026-04-12T09:13:27Z

🤖 Criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4231186236-1107-5bcbv 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing optimize_count_distinct (87a9af8) to eaf0a41 (merge-base) diff
BENCH_NAME=count_distinct
BENCH_COMMAND=cargo bench --features=parquet --bench count_distinct
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-04-12T09:19:06Z

🤖 Criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                              main                                   optimize_count_distinct
-----                              ----                                   -----------------------
count_distinct i16 bitmap          23.64   164.6±0.31µs        ? ?/sec    1.00      7.0±0.61µs        ? ?/sec
count_distinct i64 80% distinct    1.03    102.2±0.38µs        ? ?/sec    1.00     99.6±0.32µs        ? ?/sec
count_distinct i64 99% distinct    1.02    102.3±0.25µs        ? ?/sec    1.00     99.9±0.33µs        ? ?/sec
count_distinct i8 bitmap           7.04     31.0±0.06µs        ? ?/sec    1.00      4.4±0.00µs        ? ?/sec
count_distinct u16 bitmap          26.32   155.7±0.29µs        ? ?/sec    1.00      5.9±0.01µs        ? ?/sec
count_distinct u8 bitmap           7.09     30.9±0.05µs        ? ?/sec    1.00      4.4±0.01µs        ? ?/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	59.7s
Peak memory	3.2 GiB
Avg memory	3.1 GiB
CPU user	74.2s
CPU sys	0.9s
Peak spill	0 B

branch

Metric	Value
Wall time	54.5s
Peak memory	3.2 GiB
Avg memory	3.1 GiB
CPU user	69.5s
CPU sys	0.3s
Peak spill	0 B

File an issue against this benchmark runner

coderfender · 2026-04-12T15:11:45Z

@alamb, @Dandandan the regression with i64 now seems to be fixed (and slightly faster) with moving bitmap accumulations inside function call with compiler hint. Please take a look whenever you get a chance

group                              main                                   optimize_count_distinct
-----                              ----                                   -----------------------
count_distinct i16 bitmap          23.64   164.6±0.31µs        ? ?/sec    1.00      7.0±0.61µs        ? ?/sec
count_distinct i64 80% distinct    1.03    102.2±0.38µs        ? ?/sec    1.00     99.6±0.32µs        ? ?/sec
count_distinct i64 99% distinct    1.02    102.3±0.25µs        ? ?/sec    1.00     99.9±0.33µs        ? ?/sec
count_distinct i8 bitmap           7.04     31.0±0.06µs        ? ?/sec    1.00      4.4±0.00µs        ? ?/sec
count_distinct u16 bitmap          26.32   155.7±0.29µs        ? ?/sec    1.00      5.9±0.01µs        ? ?/sec
count_distinct u8 bitmap           7.09     30.9±0.05µs        ? ?/sec    1.00      4.4±0.01µs        ? ?/sec

coderfender · 2026-04-13T16:11:08Z

@martin-g , the null checks pushed up seem to be providing mixed results . Let me try and see if there are are opportunities to avoid super minor regressions

group                                                        bitmap_branch_prediction               bitmap_cold_hint                       main
-----                                                        ------------------------               ----------------                       ----
count_distinct i16 bitmap                                    1.00      2.6±0.08µs        ? ?/sec    1.12      2.9±0.03µs        ? ?/sec    31.10    80.7±0.62µs        ? ?/sec
count_distinct i64 80% distinct                              1.02     47.7±0.90µs        ? ?/sec    1.00     46.7±0.22µs        ? ?/sec    1.05     48.9±1.01µs        ? ?/sec
count_distinct i64 99% distinct                              1.02     48.4±1.23µs        ? ?/sec    1.00     47.3±0.77µs        ? ?/sec    1.08     51.0±3.38µs        ? ?/sec
count_distinct i8 bitmap                                     1.07  1144.7±13.00ns        ? ?/sec    1.00   1073.7±7.10ns        ? ?/sec    15.83    17.0±0.16µs        ? ?/sec
count_distinct u16 bitmap                                    1.00      2.6±0.08µs        ? ?/sec    1.17      3.1±0.19µs        ? ?/sec    31.17    81.4±0.84µs        ? ?/sec
count_distinct u8 bitmap                                     1.06  1144.6±55.34ns        ? ?/sec    1.00  1083.1±20.24ns        ? ?/sec    15.89    17.2±0.17µs        ? ?/sec

parthchandra · 2026-04-13T17:53:39Z

I get this on the latest version of this PR (local run on my mac)-

$critcmp main count_distinct
group                              count_distinct                         main
-----                              --------------                         ----
count_distinct i16 bitmap          1.00      5.0±0.04µs        ? ?/sec    16.93    84.3±0.86µs        ? ?/sec
count_distinct i64 80% distinct    1.01     52.5±1.45µs        ? ?/sec    1.00     52.0±0.53µs        ? ?/sec
count_distinct i64 99% distinct    1.00     51.6±0.54µs        ? ?/sec    1.00     51.8±0.15µs        ? ?/sec
count_distinct i8 bitmap           1.00  1558.2±24.06ns        ? ?/sec    12.63    19.7±0.46µs        ? ?/sec
count_distinct u16 bitmap          1.00      5.1±0.18µs        ? ?/sec    16.40    83.7±0.43µs        ? ?/sec
count_distinct u8 bitmap           1.00  1578.5±35.24ns        ? ?/sec    12.66    20.0±0.10µs        ? ?/sec

coderfender · 2026-04-14T07:11:54Z

Thank you for the approval @Dandandan :)

alamb

Thank you everyone

coderfender · 2026-04-14T20:28:57Z

Thank you for the approval @alamb

github-actions bot added the functions Changes to functions implementation label Apr 8, 2026

This comment has been minimized.

Sign in to view

alamb added the performance Make DataFusion faster label Apr 9, 2026

alamb reviewed Apr 9, 2026

View reviewed changes

Comment thread datafusion/functions-aggregate/Cargo.toml

This comment has been minimized.

Sign in to view

alamb reviewed Apr 9, 2026

View reviewed changes

Comment thread datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs Outdated

This comment has been minimized.

Sign in to view

coderfender mentioned this pull request Apr 9, 2026

chore: create benches small ints for count_distinct #21521

Merged

coderfender force-pushed the optimize_count_distinct branch from 93acd98 to 7e67e2e Compare April 9, 2026 18:53

coderfender added 7 commits April 9, 2026 14:33

bitmap_smaller_datatypes

8d49dfe

bitmap_smaller_datatypes

0b179ff

bitmap_instead_of_hll_smaller_datatypes

c6095ab

bitmap_instead_of_hll_smaller_datatypes

9d06408

bitmap_instead_of_hll_smaller_datatypes

f185fdc

bitmap_instead_smaller_datatypes

f7c487a

bitmap_instead_smaller_datatypes

3f091d9

alamb reviewed Apr 10, 2026

View reviewed changes

Comment thread datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs Outdated

coderfender added 4 commits April 10, 2026 15:20

remove_boxing_smaller_int_types

5a2918a

Merge remote-tracking branch 'refs/remotes/origin/optimize_count_dist…

3f4952e

…inct' into optimize_count_distinct

remove_boxing_smaller_int_types

289b354

never_inline_bitmap_accumulators

554f60c

alamb reviewed Apr 11, 2026

View reviewed changes

Comment thread datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs

move_bitmap_accumulator_cold_path

87a9af8

martin-g reviewed Apr 13, 2026

View reviewed changes

Dandandan approved these changes Apr 14, 2026

View reviewed changes

Dandandan mentioned this pull request Apr 14, 2026

perf: Implement groups accumulator count distinct primitive types #21561

Open

alamb approved these changes Apr 14, 2026

View reviewed changes

Conversation

coderfender commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

coderfender commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dandandan commented Apr 8, 2026

Uh oh!

Dandandan commented Apr 8, 2026

Uh oh!

coderfender commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderfender commented Apr 8, 2026

Uh oh!

This comment has been minimized.

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Uh oh!

This comment has been minimized.

This comment has been minimized.

alamb commented Apr 9, 2026

Uh oh!

alamb commented Apr 9, 2026

Uh oh!

adriangbot commented Apr 9, 2026

Uh oh!

adriangbot commented Apr 9, 2026

Uh oh!

adriangbot commented Apr 9, 2026

Uh oh!

adriangbot commented Apr 9, 2026

Uh oh!

coderfender commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Apr 10, 2026

Uh oh!

Uh oh!

adriangbot commented Apr 10, 2026

Uh oh!

adriangbot commented Apr 10, 2026

Uh oh!

coderfender commented Apr 10, 2026

Uh oh!

adriangbot commented Apr 11, 2026

Uh oh!

alamb commented Apr 11, 2026

Uh oh!

adriangbot commented Apr 11, 2026

Uh oh!

Uh oh!

adriangbot commented Apr 11, 2026

Uh oh!

coderfender commented Apr 12, 2026

Uh oh!

coderfender commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dandandan commented Apr 12, 2026

Uh oh!

adriangbot commented Apr 12, 2026

Uh oh!

adriangbot commented Apr 12, 2026

Uh oh!

coderfender commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

coderfender commented Apr 8, 2026 •

edited

Loading

coderfender commented Apr 8, 2026 •

edited

Loading

coderfender commented Apr 8, 2026 •

edited

Loading

coderfender commented Apr 10, 2026 •

edited

Loading

coderfender commented Apr 12, 2026 •

edited

Loading

coderfender commented Apr 12, 2026 •

edited

Loading

coderfender commented Apr 13, 2026 •

edited

Loading